R Coding Club
RTG 2660
2024-04-30
Often in our research, we have repeated measures: We collect more than one data point from each subject.
Advantage: increased power and precision.
Disadvantage: violates iid (independent and identically distributed) assumption of ordinary least squares methods (regression, ANOVA).
independent: probability of a data point taking on specific value is independent of of values taken by other data points.
identically distributed: data points are samples from same underlying distribution.
Problem: Observations from the same subject are usually correlated (i.e. more likely to be similar to each other than to other observations).
–> this also applies to other forms of dependency: if units of observations are clustered, e.g. mice in cage/handled by experimenter, students in classroom…
In contrast to violations of other assumptions (normality, homoscadicity), ANOVA is not robust against this violation! –> increased type 1 error/false positives
repeated measures: each subject provides several data points.
F-statistic is calculated a bit differently: \[ F = MS_{conditions} / MS_{error} \]
1
Imagine, we’d have a wine tasting and everyone of you tries six different wines and gives a rating from 1-10. This way, we would get the following data set:
| Subject | Wine1 | Wine2 | Wine3 | Wine4 | Wine5 | Wine6 | Mean_subject |
|---|---|---|---|---|---|---|---|
| Ecem | 5 | 6 | 7 | 6 | 3 | 5 | 5.333 |
| Ulkar | 7 | 8 | 8 | 7 | 6 | 6 | 7 |
| Francesco | 1 | 4 | 3 | 7 | 4 | 4 | 3.833 |
| Anna | 6 | 4 | 6 | 5 | 3 | 4 | 4.667 |
| Nikita | 8 | 7 | 4 | 6 | 9 | 3 | 6.167 |
| Zora | 2 | 3 | 2 | 10 | 4 | 3 | 4 |
| Mean_wine | 4.833 | 5.333 | 5 | 6.833 | 4.833 | 4.167 | 5.167 |
\[ SS_{wine} = \sum_{i=1}^{n}n_i(\bar{x_i}-\bar{x})^2 \]
For every wine, we subtract the grand mean (triangle) from each wine’s mean (black dots) (and multiply it by the number of subjects).
\[ SS_{within} = \sum_{j=1}^{k}\sum_{i=1}^{n}(x_{ij}-\bar{x_j})^2 \]
For the SSwithin, we subtract each individual data point from the mean of it’s wine group and add that up for all wines.
\[ SS_{subjects} = k\sum_{i=1}^{n}(\bar{x_i}-\bar{x})^2 \]
Here, we take each subject’s mean and subtract the grand average from it, and multiply it with the number of wines. \[SS_{error} = SS_{within} - SS_{subjects} \]
We can then calculate the Mean Squares by dividing with the Degrees of Freedom. The MS are then used to calculate the F-Value:
\[ MS_{wine} = \frac{SS_{wine}}{(k-1)} \]
\[ MS_{error} = \frac{SS_{error}}{(n-1)(k-1)} \]
\[F=\frac{MS_{wine}}{MS_{error}}\]
There are a variety of functions to run an ANOVA in R. We will be using those from the afex package! (But any other is fine as well! The afex package has some advantages: The contrast definition and the Type 3 sum of squares etc.)
The main function is aov_car() - because it runs the ANOVA() function from the car package. aov_ez() and aov_4() are wrappers around aov_car(): They should reproduce the same result, but the input format/syntax differs.
Let’s use a bigger data set1:
With the afex package, we can use the same function to calculate one-way ANOVA, factorial ANOVA, repeated measures ANOVA, and mixed ANOVA. For this aim, we always have to specify the Error() variance.
E.g. one-way & factorial ANOVA:
aov1 <- z %>%
aov_car(T_1 ~ condition + Error(ID),
data = .)
aov1
# using different syntax:
aov2 <- z %>%
aov_car(T_1 ~ condition * gender + Error(ID),
data = .)
aov2.1 <- z %>%
select(T_1, condition, gender, ID) %>%
aov_ez(id = "ID",
dv = "T_1",
between = c("condition", "gender"),
data = .)
aov2 <- z %>%
aov_4(T_1 ~ condition * gender + (1|ID),
data = .)
aov2Repeated measures ANOVA with afex
For a rmANOVA (in R with the afex package), we need the data to be in long format:
z_long <- z %>%
pivot_longer(cols = starts_with("T"),
names_to = "timepoint")
# calculate ANOVA
aov_rm <- z_long %>%
aov_car(value ~ 1 + Error(ID/timepoint),
data = .)
# I prefer the formular notation:
aov_rm <- z_long %>%
aov_4(value ~ 1 + (timepoint|ID),
data = .)
aov_rmAnova Table (Type 3 tests)
Response: value
Effect df MSE F ges p.value
1 timepoint 1.99, 595.97 1.06 0.28 <.001 .754
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
Sphericity correction method: GG
We can simply extent our model by adding between-subjects variables:
Anova Table (Type 3 tests)
Response: value
Effect df MSE F ges p.value
1 condition 2, 296 0.94 304.08 *** .387 <.001
2 gender 1, 296 0.94 0.32 <.001 .573
3 timepoint 1.99, 590.22 1.06 0.28 <.001 .756
4 condition:timepoint 3.99, 590.22 1.06 1.23 .006 .297
5 gender:timepoint 1.99, 590.22 1.06 0.24 <.001 .786
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
Sphericity correction method: GG
Unfortunately, we can’t use the performance``::check_model() (yet) to check the assumptions of model fit with afex. This means that we’d have to check the assumptions individually:
continuous DV
(outliers)
sphericity (the variances of the differences between all levels must be equal)
normality of residuals
OK: Data seems to be spherical (p > 0.641).
OK: Data seems to be spherical (p > 0.641).
We can use the emmeans package to calculate “estimated marginal means”:
We can use those to calculate pairwise contrasts:
contrast estimate SE df t.ratio p.value
T_1 A - T_2 A -0.0166 0.142 296 -0.117 1.0000
T_1 A - T_3 A 0.1751 0.148 296 1.181 0.9601
T_1 A - T_1 B 1.0370 0.134 296 7.735 <.0001
T_1 A - T_2 B 1.0652 0.138 296 7.709 <.0001
T_1 A - T_3 B 0.9146 0.143 296 6.382 <.0001
T_1 A - T_1 C 1.9918 0.135 296 14.712 <.0001
T_1 A - T_2 C 2.1386 0.139 296 15.423 <.0001
T_1 A - T_3 C 1.9307 0.144 296 13.423 <.0001
T_2 A - T_3 A 0.1916 0.147 296 1.307 0.9286
T_2 A - T_1 B 1.0536 0.138 296 7.626 <.0001
T_2 A - T_2 B 1.0818 0.142 296 7.642 <.0001
T_2 A - T_3 B 0.9312 0.147 296 6.341 <.0001
T_2 A - T_1 C 2.0084 0.139 296 14.491 <.0001
T_2 A - T_2 C 2.1551 0.143 296 15.076 <.0001
T_2 A - T_3 C 1.9473 0.147 296 13.207 <.0001
T_3 A - T_1 B 0.8619 0.143 296 6.015 <.0001
T_3 A - T_2 B 0.8901 0.147 296 6.061 <.0001
T_3 A - T_3 B 0.7396 0.151 296 4.885 0.0001
T_3 A - T_1 C 1.8168 0.144 296 12.643 <.0001
T_3 A - T_2 C 1.9635 0.147 296 13.324 <.0001
T_3 A - T_3 C 1.7557 0.153 296 11.484 <.0001
T_1 B - T_2 B 0.0282 0.142 296 0.199 1.0000
T_1 B - T_3 B -0.1224 0.148 296 -0.825 0.9961
T_1 B - T_1 C 0.9548 0.136 296 7.044 <.0001
T_1 B - T_2 C 1.1016 0.139 296 7.942 <.0001
T_1 B - T_3 C 0.8937 0.144 296 6.212 <.0001
T_2 B - T_3 B -0.1506 0.147 296 -1.026 0.9831
T_2 B - T_1 C 0.9266 0.139 296 6.683 <.0001
T_2 B - T_2 C 1.0734 0.143 296 7.500 <.0001
T_2 B - T_3 C 0.8655 0.147 296 5.868 <.0001
T_3 B - T_1 C 1.0772 0.144 296 7.494 <.0001
T_3 B - T_2 C 1.2239 0.147 296 8.302 <.0001
T_3 B - T_3 C 1.0161 0.153 296 6.639 <.0001
T_1 C - T_2 C 0.1467 0.143 296 1.028 0.9829
T_1 C - T_3 C -0.0611 0.149 296 -0.409 1.0000
T_2 C - T_3 C -0.2078 0.148 296 -1.406 0.8948
Results are averaged over the levels of: gender
P value adjustment: tukey method for comparing a family of 9 estimates
There are a number of ways to specify which contrasts you want to have in emmeans. Here are two examples, one using a predefined contrast (trt.vs.ctrlk) and one example with manually defined contrasts.
# compare everything to "last" level, in this case T_3:
mixed_mod %>%
emmeans(specs = trt.vs.ctrlk ~ timepoint)
###
# get emmeans
emm1 <- emmeans(mixed_mod, specs = ~ timepoint)
emm1
# we need to get the correct order:
T_1_c <- c(1,0,0)
T_2_c <- c(0,1,0)
T_3_c <- c(0,0,1)
T_23_c <- c(0,.5,.5)
# get custom contrasts from list
contrast(emm1, method = list("timepoint 1 - timepoint 3" = T_1_c - T_3_c,
"timepoint 1 - timepoint 2" = T_1_c - T_3_c,
"tp 1 - mean of tp 2 & 3" = T_1_c - (T_2_c + T_3_c)/2,
"tp 1 - mean of tp 2 & 3" = T_1_c - T_23_c) )
# same is possible with interactions/several factors
emm2 <- emmeans(mixed_mod, specs = ~ c(timepoint, condition))
emm2
T_1_A <- c(1,0,0,0,0,0,0,0,0) # etcSometimes, you do have a continuous IV in combination with e.g. a factorial IV - e.g. if you want to further investigate their interaction. Here, you can’t easily work with contrasts (because which value do you take from the continuous variable? The mean?). You can get the slope (of the predicted values) for the cont. IV within each factor level of the other categorical factor. This is called a simple slope. For this aim, you would use the function emtrends().
(We will cover this later, when we talk about Linear Mixed Models - because you would probably not expect an interaction between a factor/IV and a cont. covariate…)
Next time? May 14th?
Topic? LMM? :D